Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This leverages the
gibberish-detector
with a model based on all RFCs (at the time of writing). I wanted to use a source that was somewhat representative of Computer Science jargon: after all, we're attempting to look at variable names / strings in code, and trying to determine whether they are actually secret values. The typical "Complete List of Sherlock Holmes" as a corpus won't cut it.This commit also introduces a variety of minor bug fixes. This includes:
detect-secrets-hook
, which makes a lot more senseMethodology
Turns out, you can download all RFCs to local disk,
for easier processing. So I did that, and ran this simple script to train the model.
then,
TBH, I'm pretty surprised how quickly it went.
Testing
I tested this model with all the
FALSE_POSITIVES
found in the KeywordDetector, and measured their score. I also tested this with known secrets, and measured their score. Finally, based off internal data, this was able to reduce false positive rates by a whopping ~60% -- which is pretty superb, IMO.It looks like 3.7 is a pretty conservative bar for gibberish strings, based on this model (conservative defined here as preferring false positives than false negatives), with one false negative from the corpus (
dummy
).Reviewers Notes
I need to bundle the
rfc.model
with the package, but I'm not sure whether I'm doing it right. I followed these instructions, and will test in test.pypi.org whether it works?